Method for detecting speech activity

ABSTRACT

A digital speech signal processed by successive frames is subjected to noise suppression taking account of estimates of the noise included in the signal, updated for each frame in a manner dependent on at least one degree of vocal activity. A priori noise suppression is applied to the speech signal of each frame on the basis of estimates of the noise obtained on processing at least one preceding frame, and the energy variations of the a priori noise-suppressed signal are analyzed to detect the degree of vocal activity of said frame.

BACKGROUND OF THE INVENTION

The present invention relates to digital speech signal processingtechniques. It relates more particularly to techniques which detectvocal activity to perform different processing according to whether thesignal is supporting vocal activity or not.

The digital techniques in question relate to various domains: coding ofspeech for transmission or storage, speech recognition, noise reduction,echo cancellation, etc.

The main difficulty with vocal activity detection methods isdistinguishing vocal activity from the accompanying noise. Aconventional noise suppression technique cannot solve this problembecause these techniques themselves use estimates of the noise whichdepend on the degree of vocal activity of the signal.

A main object of the present invention is to make vocal activitydetection methods more robust to noise.

SUMMARY OF THE INVENTION

The invention therefore proposes a method of detecting vocal activity ina digital speech signal processed by successive frames, in which methodthe speech signal is subjected to noise suppression taking account ofestimates of the noise included in the signal, updated for each frame ina manner dependent on at least one degree of vocal activity determinedfor said frame. According to the invention, a priori noise suppressionis applied to the speech signal of each frame on the basis of estimatesof the noise obtained on processing at least one preceding frame, andthe energy variations of the a priori noise-suppressed signal areanalyzed to detect the degree of vocal activity of said frame.

Detecting vocal activity (as a general rule by any method known in theart) on the basis of a noise-suppressed signal a priori significantlyimproves the performance of detection if the level of surrounding noiseis relatively high.

In the remainder of the present description, the vocal activitydetection method of the invention is illustrated within a system foreliminating noise from a speech signal. Clearly the method can findapplications in many other types of digital speech processing requiringinformation on the degree of vocal activity of the processed signal:coding, recognition, echo cancellation, etc.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a noise suppression system implementing thepresent invention;

FIGS. 2 and 3 are flowcharts of procedures used by a vocal activitydetector of the system shown in FIG. 1;

FIG. 4 is a diagram representing the states of a vocal activitydetection automaton;

FIG. 5 is a graph showing variations in a degree of vocal activity;

FIG. 6 is a block diagram of a module for overestimating the noise ofthe system shown in FIG. 1;

FIG. 7 is a graph illustrating the computation of a masking curve; and

FIG. 8 is a graph illustrating the use of masking curves in the systemshown in FIG. 1.

DESCRIPTION OF PREFERRED EMBODIMENTS

The noise suppression system shown in FIG. 1 processes a digital speechsignal s. A windowing module 10 formats the signal s in the form ofsuccessive windows or frames each made up of a number N of digitalsignal samples. In the usual way, these frames can overlap each other.In the remainder of this description, the frames are considered to bemade up of N=256 samples with a sampling frequency F_(e) of 8 kHz, withHamming weighting in each window and with 50% overlaps betweenconsecutive windows, although this is not limiting on the invention.

The signal frame is transformed into the frequency domain by a module 11using a conventional fast Fourier transform (FFT) algorithm to computethe modulus of the spectrum of the signal. The module 11 then delivers aset of N=256 frequency components S_(n,f) of the speech signal, where nis the number of the current frame and f is a frequency from thediscrete spectrum. Because of the properties of the digital signals inthe frequency domain, only the first N/2=128 samples are used.

Instead of using the frequency resolution available downstream of thefast Fourier transform to compute the estimates of the noise containedin the signal s, a lower resolution is used, determined by a number I offrequency bands covering the bandwidth [0,F_(e)/2] of the signal. Eachband i (1≦i≦I) extends from a lower frequency f(i−1) to a higherfrequency f(i), with f(0)=0 and f(I)=F_(e)/2. The subdivision intofrequency bands can be uniform (f(i)−f(I−1)=F_(e)/2I). It can also benon-uniform (for example according to a barks scale) A module 12computes the respective averages of the spectral components S_(n,f) ofthe speech signal in bands, for example by means of a uniform weightingsuch as: $\begin{matrix}{S_{n,i} = {\frac{1}{{f(i)} - {f\left( {i - 1} \right)}}{\sum\limits_{f \in {\lbrack{{f{({i - 1})}},{{f{(i)}}\lbrack}}}}\quad S_{n,f}}}} & (1)\end{matrix}$

This averaging reduces fluctuations between bands by averaging thecontributions of the noise in the bands, which reduces the variance ofthe noise estimator. Also, this averaging greatly reduces the complexityof the system.

The averaged spectral components S_(n,i) are sent to a vocal activitydetector module 15 and a noise estimator module 16. The two modules 15,16 operate conjointly in the sense that degrees of vocal activityγ_(n,i) measured for the various bands by the module 15 are used by themodule 16 to estimate the long-term energy of the noise in the variousbands, whereas the long-term estimates {circumflex over (B)}_(n,i) areused by the module 15 for a priori suppression of noise in the speechsignal in the various bands to determine the degrees of vocal activityγ_(n,i).

The operation of the modules 15 and 16 can correspond to the flowchartsshown in FIGS. 2 and 3.

In steps 17 through 20, the module 15 effects a priori suppression ofnoise in the speech signal in the various bands i for the signal framen. This a priori noise suppression is effected by a conventionalnon-linear spectral subtraction scheme based on estimates of the noiseobtained in one or more preceding frames. In step 17, using theresolution of the bands I, the module 15 computes the frequency responseHp_(n,i) of the a priori noise suppression filter from the equation:$\begin{matrix}{{Hp}_{n,i} = \frac{S_{n,i} - {\alpha_{{n - {\tau \quad 1}},i}^{\prime} \cdot {\hat{B}}_{{n - {\tau \quad 1}},i}}}{S_{{n - {\tau 2}},i}}} & (2)\end{matrix}$

where τ1 and τ2 are delays expressed as a number of frames (τ1≧1, τ2≧0),and α′_(n,i) an is a noise overestimation coefficient determined asexplained later. The delay τ1 can be fixed (for example τ1=1) orvariable. The greater the degree of confidence in the detection of vocalactivity, the lower the value of τ1.

In steps 18 to 20, the spectral components Êp_(n,i) are computed from:

Êp _(n,i)=max{Hp _(n,i) ·S _(n,i) ,βp _(i) ·{circumflex over (B)}_(n−τ1,i)}  (3)

where βp_(i) is a floor coefficient close to 0, used conventionally toprevent the spectrum of the noise-suppressed signal from taking negativevalues or excessively low values which would give rise to musical noise.

Steps 17 to 20 therefore essentially consist of subtracting from thespectrum of the signal an estimate of the a priori estimated noisespectrum, over-weighted by the coefficient α′_(n−τ1,i).

In step 21, the module 15 computes the energy of the a priorinoise-suppressed signal in the various bands i for frame n:E_(n,i)=Êp_(n,i) ². It also computes a global average E_(n,0) of theenergy of the a priori noise-suppressed signal by summing the energiesfor each band E_(n,i), weighted by the widths of the bands. In thefollowing notation, the index i=0 is used to designate the global bandof the signal.

In steps 22 and 23, the module 15 computes, for each band i (0≦i≦I), amagnitude ΔE_(n,i) representing the short-term variation in the energyof the noise-suppressed signal in the band i and a long-term value{overscore (E)}_(n,i) of the energy of the noise-suppressed signal inthe band i. The magnitude ΔE_(n,i) can be computed from a simplifiedequation:${\Delta \quad E_{n,i}} = {{\frac{E_{{n - 4},i} + E_{{n - 3},i} - E_{{n - 1},i} - E_{n,i}}{10}}.}$

As for the long-term energy {overscore (E)}_(n,i), it can be computedusing a forgetting factor B1 such that 0<B1<1, namely {overscore(E)}_(n,i)=B1·{overscore (E)}_(n−1),+(1−B1)·E_(n,i).

After computing the energies E_(n,i) of the noise-suppressed signal, itsshort-term variations ΔE_(n,i) and its long-term values {overscore(E)}_(n,i) in the manner indicated in FIG. 2, the module 15 computes,for each band i (0≦i≦I), a value ρ_(i) representative of the evolutionof the energy of the noise-suppressed signal. This computation iseffected in steps 25 to 36 in FIG. 3, executed for each band i from i=0to i=I. The computation uses a long-term noise envelope estimatorba_(i), an internal estimator bi_(i) and a noisy frame counter b_(i).

In step 25, the magnitude ΔE_(n,i) is compared to a threshold ε1. If thethreshold ε1 has not been reached, the counter b_(i) is incremented byone unit in step 26. In step 27, the long-term estimator ba_(i) iscompared to the smoothed energy value {overscore (E)}_(n,i). Ifba_(i)≧{overscore (E)}_(n,i), the estimator ba_(i) is taken as equal tothe smoothed value {overscore (E)}_(n,i) in step 28 and the counterb_(i) is reset to zero. The magnitude ρ_(i), which is taken as equal toba_(i)/{overscore (E)}_(n,i) (step 36), is then equal to 1.

If step 27 shows that ba_(i)<{overscore (E)}_(n,i), the counter b_(i) iscompared to a limit value bmax in step 29. If b_(i)>bmax, the signal isconsidered to be too stationary to support vocal activity. Theaforementioned step 28, which amounts to considering that the framecontains only noise, is then executed. If b_(i)≦bmax in step 29, theinternal estimator bi_(i) is computed in step 33 from the equation:

bi _(i)=(1−Bm)·{overscore (E)} _(n,i) +Bm·ba _(i)  (4)

In the above equation, Bm represents an update coefficient from 0.90to 1. Its value differs according to the state of a vocal activitydetector automaton (steps 30 to 32). The state δ_(n−1) is thatdetermined during processing of the preceding frame. If the automaton isin a speech detection state (δ_(n−1)=2 in step 30), the coefficient Bmtakes a value Bmp very close to 1 so the noise estimator is veryslightly updated in the presence of speech. Otherwise, the coefficientBm takes a lower value Bms to enable more meaningful updating of thenoise estimator in the silence phase. In step 34, the differenceba_(i)−bi_(i) between the long-term estimator and the internal noiseestimator is compared with a threshold ε2. If the threshold ε2 has notbeen reached, the long-term estimator ba_(i) is updated with the valueof the internal estimator bi_(i) in step 35. Otherwise, the long-termestimator ba_(i) remains unchanged. This prevents sudden variations dueto a speech signal causing the noise estimator to be updated.

After the magnitudes ρ_(i) have been obtained, the module 15 proceeds tothe vocal activity decisions of step 37. The module 15 first updates thestate of the detection automaton according to the magnitude ρ₀calculated for all of the band of the signal. The new state δ_(n) of theautomaton depends on the preceding state δ_(n−1) and on ρ₀, as shown inFIG. 4.

Four states are possible: δ=0 detects silence, or absence of speech, δ=2detects the presence of vocal activity and states δ=1 and δ=3 areintermediate rising and falling states. If the automaton is in thesilence state (δ_(n−1)=0) it remains there if ρ₀ does not exceed a firstthreshold SE1, and otherwise goes to the rising state. In the risingstate (δ_(n−1)=1), it reverts to the silence state if ρ₀ is smaller thanthe threshold SE1, goes to the speech state if ρ₀ is greater than asecond threshold SE2 greater than the threshold SE1 and it remains inthe rising state if SE1≦ρ₀≦SE2. If the automaton is in the speech state(δ_(n−1)=2), it remains there if ρ₀ exceeds a third threshold SE3 lowerthan the threshold SE2, and enters the falling state otherwise. In thefalling state (δ_(n−1)=3), the automaton reverts to the speech state ifρ₀ is higher than the threshold SE2, reverts the silence state if ρ₀ isbelow a fourth threshold SE4 lower than the threshold SE2 and remains inthe falling state if SE4≦ρ₀≦SE2.

In step 37, the module 15 also computes the degrees of vocal activityγ_(n,i) in each band i≧1. This degree γ_(n,i) is preferably a non-binaryparameter, i.e. the function γ_(n,i)=g(ρ_(i)) is a function varyingcontinuously in the range from 0 to 1 as a function of the values takenby the magnitude ρ_(i). This function has the shape shown in FIG. 5, forexample.

The module 16 calculates the estimates of the noise on a band by bandbasis, and the estimates are used in the noise suppression process,employing successive values of the components S_(n,i) and the degrees ofvocal activity γ_(n,i). This corresponds to steps 40 to 42 in FIG. 3.Step 40 determines if the vocal activity detector automaton has justgone from the rising state to the speech state. If so, the last twoestimates {circumflex over (B)}_(n−1,i) and {circumflex over(B)}_(n−2,i) previously computed for each band i≧1 are correctedaccording to the value of the preceding estimate {circumflex over(B)}_(n−3,i). The correction is done to allow for the fact that, in therise phase (δ=1), the long-term estimates of the energy of the noise inthe vocal activity detection process (steps 30 to 33) were computed asif the signal included only noise (Bm=Bms), with the result that theymay be subject to error.

In step 42, the module 16 updates the estimates of the noise on a bandby band basis using the equations:

{tilde over (B)} _(n,i)=γ_(B) ·{circumflex over (B)}_(n−1,i)+(1−γ_(B))·S _(n,i)  (5)

{circumflex over (B)} _(n,i)=γ_(n,i) ·{circumflex over (B)}_(n−1,i)+(1−γ_(n,i))·{tilde over (B)} _(n,i)  (6)

in which λ_(B) designates a forgetting factor such that 0<λ_(B)<¹.Equation (6) shows that the non-binary degree of vocal activity γ_(n,i)is taken into account.

As previously indicated, the long-term estimates of the noise{circumflex over (B)}_(n,i) are overestimated by a module 45 (FIG. 1)before noise suppression by non-linear spectral subtraction. The module45 computes the overestimation coefficient α′_(n,i) previously referredto, along with an overestimate {circumflex over (B)}′_(n,i) whichessentially corresponds to α′_(n,i)·{circumflex over (B)}_(n,i).

FIG. 6 shows the organisation of the overestimation module 45. Theoverestimate {circumflex over (B)}′_(n,i) is obtained by combining thelong-term estimate {circumflex over (B)}_(n,i) and a measurementΔB_(n,i) ^(max) of the variability of the component of the noise in theband i around its long-term estimate. In the example considered, thecombination is essentially a simple sum performed by an adder 46. Itcould instead be a weighted sum.

The overestimation coefficient α′_(n,i) is equal to the ratio betweenthe sum {circumflex over (B)}_(n,i)+ΔB_(n,i) ^(max) delivered by theadder 46 and the delayed long-term estimate {circumflex over(B)}_(n−τ3,i) (divider 47), with a ceiling limit value α_(max), forexample α_(max)=4 (block 48). The delay τ3 is used to correct the valueof the overestimation coefficient α′_(n,i), if necessary, in the risingphases (δ=1), before the long-term estimates have been corrected bysteps 40 and 41 from FIG. 3 (for example δ3=3).

The overestimate {circumflex over (B)}′_(n,i) is finally taken as equalto α′_(n,i)·{circumflex over (B)}_(n−τ3,i) (multiplier 49).

The measurement ΔB_(n,i) ^(max) of the variability of the noise reflectsthe variance of the noise estimator. It is obtained as a function of thevalues of S_(n,i) and of {circumflex over (B)}_(n,i) computed for acertain number of preceding frames over which the speech signal does notfeature any vocal activity in band i. It is a function of thedifferences |S_(n−k,i)−{circumflex over (B)}_(n−k,i)| computed for anumber K of silence frames (n−k≦n). In the example shown, this functionis simply the maximum (block 50). For each frame n, the degree of vocalactivity γ_(n,i) is compared to a threshold (block 51) to decide if thedifference |S_(n,i)−{circumflex over (B)}_(n,i)|, calculated at 52-53,must be loaded into a queue 54 with K locations organised infirst-in/first-out (FIFO) mode, or not. If γ_(n,i) does not exceed thethreshold (which can be equal to 0 if the function g( ) has the formshown in FIG. 5), the FIFO 54 is not loaded; otherwise it is loaded. Themaximum value contained in the FIFO 54 is then supplied as the measuredvariability ΔB_(n,i) ^(max).

The measured variability ΔB_(n,i) ^(max) can instead be obtained as afunction of the values S_(n,f) (not S_(n,i)) and {circumflex over(B)}_(n,i). The procedure is then the same, except that the FIFO 54contains, instead of |S_(n−k,i)−{circumflex over (B)}_(n−k,i)| for eachof the bands i,$\max\limits_{f \in {\lbrack{{f{({i - 1})}},{{f{(i)}}\lbrack}}}}{{{S_{{n - k},f} - {\hat{B}}_{{n - k},i}}}.}$

Because of the independent estimates of the long-term fluctuations{circumflex over (B)}_(n,i) and short-term variability ΔB_(n,i) ^(max)of the noise, the overestimator {circumflex over (B)}′_(n,i) makes thenoise suppression process highly robust to musical noise.

The module 55 shown in FIG. 1 performs a first spectral subtractionphase. This phase supplies, with the resolution of the bands i (1≦i≦I),the frequency response H_(n,i) ¹ of a first noise suppression filter, asa function of the components S_(n,i) and {circumflex over (B)}_(n,i) andthe overestimation coefficients α′_(n,i). This computation can beperformed for each band i using the equation: $\begin{matrix}{H_{n,i}^{1} = \frac{\max \left\{ {{S_{n,i} - {\alpha_{n,i}^{\prime} \cdot {\hat{B}}_{n,i}}},{\beta_{i}^{1} \cdot {\hat{B}}_{n,i}}} \right\}}{S_{{n - {\tau 4}},i}}} & (7)\end{matrix}$

in which τ4 is an integer delay such that τ4>0 (for example τ4=0). Thecoefficient β_(i) ¹ in equation (7), like the coefficient βp_(i) inequation (3), represents a floor used conventionally to avoid negativevalues or excessively low values of the noise-suppressed signal.

In a manner known in the art (see EP-A-0 534 837), the overestimationcoefficient α′_(n,i) in equation (7) could be replaced by anothercoefficient equal to a function of α′_(n,i) and an estimate of thesignal-to-noise ratio (for example S_(n,i)/{circumflex over (B)}_(n,i))this function being a decreasing function of the estimated value of thesignal-to-noise ratio. This function is then equal to α′_(n,i) for thelowest values of the signal-to-noise ratio. If the signal is very noisy,there is clearly no utility in reducing the overestimation factor. Thisfunction advantageously decreases toward zero for the highest values ofthe signal/noise ratio. This protects the highest energy areas of thespectrum, in which the speech signal is the most meaningful, thequantity subtracted from the signal then tending toward zero.

This strategy can be refined by applying it selectively to the harmonicsof the pitch frequency of the speech signal if the latter features vocalactivity.

Accordingly, in the embodiment shown in FIG. 1, a second noisesuppression phase is performed by a harmonic protection module 56. Thismodule computes, with the resolution of the Fourier transform, thefrequency response H_(n,f) ² of a second noise suppression filter as afunction of the parameters H_(n,i) ¹, α′_(n,i), {circumflex over(B)}_(n,i), δ_(n), S_(n,i) and the pitch frequency f_(p)=F_(e)/T_(p)computed outside silence phases by a harmonic analysis module 57. In asilence phase (δ_(n)=0), the module 56 is not in service, i.e. H_(n,f)²=H_(n,i) ¹ for each frequency f of a band i. The module 57 can use anyprior art method to analyse the speech signal of the frame to determinethe pitch period T_(p), expressed as an integer or fractional number ofsamples, for example a linear prediction method.

The protection afforded by the module 56 can consist in effecting, foreach frequency f belonging to a band i: $\begin{matrix}\left\{ \begin{matrix}{H_{n,f}^{2} = {1\quad {if}\quad \left\{ \begin{matrix}{{S_{n,i} - {\alpha_{n,i}^{\prime} \cdot {\hat{B}}_{n,i}}} > {\beta_{i}^{2} \cdot {\hat{\beta}}_{n,i}}} \\{{and}\quad {\exists{{\eta \quad {{integer}/{{f - {\eta \cdot f_{p}}}}}} \leq {\Delta \quad {f/2}}}}}\end{matrix} \right.}} & (8) \\{H_{n,f}^{2} = {H_{n,f}^{1}\quad {otherwise}}} & (9)\end{matrix} \right. & \quad\end{matrix}$

Δf=F_(e)/N represents the spectral resolution of the Fourier transform.If H_(n,f) ²=1, the quantity subtracted from the component S_(n,f) iszero. In this computation, the floor coefficients β_(i) ² (for exampleβ_(i) ²=β_(i) ¹) express the fact that some harmonics of the pitchfrequency f_(p) can be masked by noise, so that there is no utility inprotecting them.

This protection strategy is preferably applied for each of thefrequencies closest to the harmonics of f_(p), i.e. for any integer η.

If δf_(p) denotes the frequency resolution with which the analysismodule 57 produces the estimated pitch frequency f_(p), i.e. if the realpitch frequency is between f_(p)−δf_(p)/2 and f_(p)+δf_(p)/2, then thedifference between the η-th harmonic of the real pitch frequency and itsestimate η×f_(p) (condition (9)) can go up to ±η×δf_(p)/2. For highvalues of η, the difference can be greater than the spectralhalf-resolution Δf/2 of the Fourier transform. To take account of thisuncertainty, and to guarantee good protection of the harmonics of thereal pitch, each of the frequencies in the range [η×f_(p)−η×δf_(p)/2,η×f_(p)+η×f_(p)/2] can be protected, i.e. condition (9) above can bereplaced with:

∃η integer/|f−η·f _(p)|≦(η·δf _(p) +Δf)/2  (9′)

This approach (condition (9′)) is of particular benefit if the values ofη can be high, especially if the process is used in a broadband system.

For each protected frequency, the corrected frequency response H_(n,f) ²can be equal to 1, as indicated above, which in the context of spectralsubtraction corresponds to the subtraction of a zero quantity, i.e. tocomplete protection of the frequency in question. More generally, thiscorrected frequency response H_(n,f) ² could be taken as equal to avalue from 1 to H_(n,f) ¹ according to the required degree ofprotection, which corresponds to subtracting a quantity less than thatwhich would be subtracted if the frequency in question were notprotected.

The spectral components S_(n,f) ² of a noise-suppressed signal arecomputed by a multiplier 58:

S _(n,f) ² =H _(n,f) ² ·S _(n,f)  (10)

This signal S_(n,f) ² is supplied to a module 60 which computes amasking curve for each frame n by applying a psychoacoustic model of howthe human ear perceives sound.

The masking phenomenon is a well-known principle of the operation of thehuman ear. If two frequencies are present simultaneously, it is possiblefor one of them not to be audible. It is then said to be masked.

There are various methods of computing masking curves. The methoddeveloped by J. D. Johnston can be used, for example (“Transform Codingof Audio Signals Using Perceptual Noise Criteria”, IEEE Journal onSelected Areas in Communications, Vol. 6, No. 2, February 1988). Thatmethod operates in the barks frequency scale. The masking curve is seenas the convolution of the spectrum spreading function of the basilarmembrane in the bark domain with the exciter signal, which in thepresent application is the signal S_(n,f) ². The spectrum spreadingfunction can be modelled in the manner shown in FIG. 7. For each barkband, the contribution of the lower and higher bands convoluted with thespreading function of the basilar membrane is computed from theequation: $\begin{matrix}{C_{n,q} = {{\sum\limits_{q^{\prime} = 0}^{q - 1}\quad \frac{S_{n,q^{\prime}}^{2}}{\left( 10^{10/10} \right)^{({q - q^{\prime}})}}} + {\sum\limits_{q^{\prime} = {q + 1}}^{Q}\quad \frac{S_{n,q^{\prime}}^{2}}{\left( 10^{25/10} \right)^{({q^{\prime} - q})}}}}} & (11)\end{matrix}$

in which the indices q and q′ designate the bark bands (0≦q,q′≦Q) andS_(n,q) ² represents the average of the components S_(n,f) ² of thenoise-suppressed exciter signal for the discrete frequencies f belongingto the bark band q′.

The module 60 obtains the masking threshold M_(n,q) for each bark band qfrom the equation:

 M _(n,q) =C _(n,q) /R _(q)  (12)

in which R_(q) depends on whether the signal is relatively more orrelatively less voiced. As is well-known in the art, one possible formof R_(q) is:

10·log₁₀(R _(q))=(A+q)·χ+B·(1−χ)  (13)

with A=14.5 and B=5.5. χ designated a degree of voicing of the speechsignal, varying from 0 (no voicing) to 1 (highly voiced signal). Theparameter χ can be of the form known in the art: $\begin{matrix}{\chi = {\min \left\{ {\frac{SFM}{{SFM}_{\max}},1} \right\}}} & (12)\end{matrix}$

where SFM represents the ratio in decibels between the arithmetic meanand the geometric mean of the energy of the bark bands and SFM_(max)=−60dB.

The noise suppression system further includes a module 62 which correctsthe frequency response of the noise suppression filter as a function ofthe masking curve M_(n,q) computed by the module 60 and theoverestimates {circumflex over (B)}′_(n,i) computed by the module 45.The module 62 decides which noise suppression level must really beachieved.

By comparing the envelope of the noise overestimate with the envelopeformed by the masking thresholds M_(n,q), a decision is taken tosuppress noise in the signal only to the extent that the overestimate{circumflex over (B)}{circumflex over (′)}_(n,i) is above the maskingcurve. This avoids unnecessary suppression of noise masked by speech.

The new response H_(n,f) ³, for a frequency f belonging to the band idefined by the module 12 and the bark band q, thus depends on therelative difference between the overestimate {circumflex over(B)}′_(n,i) of the corresponding spectral component of the noise and themasking curve M_(n,q), in the following manner: $\begin{matrix}{H_{n,f}^{3} = {1 - {{\left( {1 - H_{n,f}^{2}} \right) \cdot \max}\left\{ {\frac{{\hat{B}}_{n,i}^{\prime} - M_{n,q}}{{\hat{B}}_{n,i}^{\prime}},0} \right\}}}} & (14)\end{matrix}$

In other words, the quantity subtracted from a spectral componentS_(n,f), in the spectral subtraction process having the frequencyresponse H_(n,f) ³, is substantially equal to whichever is the lower ofthe quantity subtracted from this spectral component in the spectralsubtraction process having the frequency response H_(n,f) ² and thefraction of the overestimate {circumflex over (B)}′_(n,i) of thecorresponding spectral component of the noise which possibly exceeds themasking curve M_(n,q).

FIG. 8 illustrates the principle of the correction applied by the module62. It shows in schematic form an example of a masking curve M_(n,q)computed on the basis of the spectral components S_(n,f) ² of thenoise-suppressed signal as well as the overestimate {circumflex over(B)}′_(n,i) of the noise spectrum. The quantity finally subtracted fromthe components S_(n,f) is that shown by the shaded areas, i.e. it islimited to the fraction of the overestimate {circumflex over (B)}′_(n,i)of the spectral components of the noise which is above the maskingcurve.

The subtraction is effected by multiplying the frequency responseH_(n,f) ³ of the noise suppression filter by the spectral componentsS_(n,f) of the speech signal (multiplier 64). The module 65 thenreconstructs the noise-suppressed signal in the time domain by applyingthe inverse fast Fourier transform (IFFT) to the samples of frequencyS_(n,f) ³ delivered by the multiplier 64. For each frame, only the firstN/2=128 samples of the signal produced by the module 65 are delivered asthe final noise-suppressed signal s³, after overlap-add reconstructionwith the N/2=128 last samples of the preceding frame (module 66).

What is claimed is:
 1. Method of detecting vocal activity in a digitalspeech signal processed by successive frames, comprising the steps of:applying a priori noise suppression to the speech signal of each frameon the basis of noise estimates representative of noise included in thesignal, said noise estimates being obtained on processing at least onepreceding frame; analyzing energy variations of the a priorinoise-suppressed signal to detect at least one degree of vocal activityof said frame; and updating said noise estimates in a manner dependenton said at least one degree of vocal activity detected for said frame.2. Method according to claim 1, wherein each degree of vocal activity isa non-binary parameter.
 3. Method according to claim 2, wherein eachdegree of vocal activity is a function which varies in a continuousmanner in the range from 0 to
 1. 4. Method according to claim 1, whereinthe noise estimates are obtained in different frequency bands of thesignal, the a priori noise suppression is effected band by band, and adegree of vocal activity is determined for each band.
 5. Methodaccording to claim 1, wherein a noise estimate {circumflex over(B)}_(n,i) is obtained for a frame n in a band of frequencies i in theform: {circumflex over (B)} _(n,i)=γ_(n,i) ·{circumflex over (B)}_(n−1,i)+(1−γ_(n,i))·{tilde over (B)}_(n,i) where {tilde over (B)}_(n,i)=γ_(B) ·{circumflex over (B)} _(n−1)+(1−γ_(B))·S _(n,i) whereλ_(B) is a forgetting factor in the range from 0 to 1, γ_(n,i) is one ofsaid at least one degree of vocal activity determined for the frame n inthe band of frequencies i, and S_(n,i) is an average speech signalamplitude in frame n in band i.
 6. Method according to claim 5, in whichthe a priori noise-suppressed signal Êp_(n,i) relative to a frame n anda band of frequencies i is of the form: Êp _(n,i)=max{Hp _(n,i) ·S_(n,i) ,βp _(i) ·{circumflex over (B)} _(n−τ1,i)} where${{Hp}_{n,i} = \frac{{S_{n,i} - {\alpha_{{n - {\tau 1}},i}^{\prime} \cdot {\hat{B}}_{n - {\tau 1}}}},i}{S_{{n - {\tau 2}},i}}},$

τ1 is an integer at least equal to 1, τ2 is an integer at least equal to0, α′_(n−τ1,i) is an overestimation coefficient determined for the framen−τ1 and the band i, and βp_(i) is a positive coefficient.
 7. Methodaccording to claim 1, wherein the step of analysing the energyvariations comprises estimating a long-term estimate of the energy ofthe a priori noise-suppressed signal and comparing said long-termestimate with an instantaneous estimate of said energy, computed over acurrent frame, to obtain one of said at least one degree of vocalactivity of said frame.
 8. Voice activity detector for detecting vocalactivity in a digital speech signal processed by successive frames,comprising: means for applying a priori noise suppression to the speechsignal of each frame on the basis of noise estimates representative ofnoise included in the signal, said noise estimates being obtained onprocessing at least one preceding frame; means for analyzing energyvariations of the a priori noise-suppressed signal to detect at leastone degree of vocal activity of said frame; and means for updating saidnoise estimates in a manner dependent on said at least one degree ofvocal activity detected for said frame.
 9. Voice activity detectoraccording to claim 8, wherein each degree of vocal activity is anon-binary parameter.
 10. Voice activity detector according to claim 9,wherein each degree of vocal activity is a function which varies in acontinuous manner in the range from 0 to
 1. 11. Voice activity detectoraccording to claim 8, wherein the noise estimates are obtained indifferent frequency bands of the signal, the means for applying a priorinoise suppression to the speech signal operate band by band, and adegree of vocal activity is determined for each band.
 12. Voice activitydetector according to claim 8, wherein the means for analyzing theenergy variations comprises means for estimating a long-term estimate ofthe energy of the a priori noise-suppressed signal and means forcomparing said long-term estimate with an instantaneous estimate of saidenergy, computed over a current frame, to obtain one of said at leastone degree of vocal activity of said frame.